Kernel density estimation: use these two data sets klient1.txt and klient3.txt. Plot the density distribution to compare them. The numbers represents the time of week (measured in hours from Sunday midnight), when two different groups of people have gone shopping over the entire year. Choose an informative kernel and kernel width and justify your choice. Characterise briefly these two data sets (klient1 and klient3).
Solution: A kernel density estimate (KDE) is a nonparametric estimate for the density of a data sample. KDE is an effecient method to alleviate the problems that exists when we construct a histogram. Histograms depend on the width of the bins and their end points and they are not smooth.Therefore we can use kernel density estimation in order to acheive better determination of the distribution shape.
Efficient use of KDE requires the optimal selection of the smoothing parameter called the bandwidth of the kernel. The choice of the bandwidth is very important. Small bandwidth leads to an estimator with small bias and large variance. Large bandwidth leads to a small variance at the expense of increase in bias. As a result the bandwidth has to be chosen optimally.
We need to plot the kernel denity estimation for two data sets klient1.txt and klient3.txt that represents the time of week (measured in hours from Sunday midnight), when two different groups of people have gone shopping over the entire year.
R provides several methods for providing the bandwidth. Using the functon density we can choose between different methods for determing the bandwith and different types of smoothing kernels.
From the images we can see the density estimation for the data sets Klient 1 and Klient 3 using Gaussian, Triangular and Rectangular Kernel and three different methods for bandwidth calculation:
nrd which uses Scot’s rule. The value of the bandwith for Klient 1 is \(5.055997\) and for Klient 3 is \(5.086402\).
nrd0 which uses Silverman’s rule of thumb which can be calucated using the following formula \(1.06*sd(Klient1)*length(Klient1)^{(-\frac{1}{5})}\). The value of the bandwith for Klient 1 is \(4.292827\) and for Klient 3 is \(4.318643\).
bcv which applies biased cross-validation. The value of the bandwith for Klient 1 is \(5.355643\) and for Kleint 3 is \(5.245306\).
We can notice that we get the best results for both Klient 1 and Klient 3 if we use Gaussian kernel. However we need to change the value of the bandwidth since the built in methods do not provide the optimal results. The optimal bandwidth for Klient1 is \(2\) and for Klient 3 is \(1\).
If we overlay the density plots of the two data sets we can notice that the distribution for Klient 3 is narrower than the one of Klient 1.
In the picture bellow we can see an example for oversmoothed and undersmoothed plots.
Extract from above data information about Fridays and Saturdays. Plot the four density plots (2 sets of clients, 2 days) as in task 1. by overlaying them above each other. Select two different distributions (from four) and make a Q-Q plot to compare them. Interpret the Q-Q plot and describe how the distributions differ.
Solution:
The data represents the time of week measured in hours from Sunday midnight. We can extract the data for Friday and Saturday if we extract he values between \(96\) and \(120\) for Friday and \(120\) and \(144\) for Saturday. If we scale the data to the 24 hours scale and we plot the densities we get the following plot:
To obtain the density plots a Gaussian kernel and the Scot’s rule (nrd) are used.
Q-Q plots are used to check if a set of observations is approximately normally distributed. This would result in an approximately straight line. If we plot the Q-Q plots for the data sets we get can notice that they do not form a straigt line which means that their distribution is not normal.
We can use Q-Q plot to compare the data set from Saturday (Klient 1) and Saturday (Klient 3). From the range of 13 to 17 the values for Klient 1 are smaller than those of Kleint 3. Also the values of Klient 1 from approximately 19 to 24 are larger than those of Klient 3. From this we can conclude that they do not appear to have come from data sets with a common distribution because they are not form a straight line. Also, the plot is S shaped, indicating that one of the distributions is more skewed than the other, or that one of the distributions has heavier tails than the other.
Study the data product_time_shop.txt. There is information about a few products, shops and times of purchaces through the week. Describe the data - what products, shops, how many purchases of different products in different shops, and which periods are covered?
Solution:
The data cosists of \(4\) columns that represnt the date and time of the purchase, the shop in which the products was sold and the list of products.
Date: This feature consists of \(11\) different dates on which a product was sold. The type of the feature is factor.
Time: This continous feature shows the time when a purchase was made.The type of the feature is integer.
Product: This feature shows the list of products (Banana, Coffee_Cream, Eggs_1,Eggs_2, Grapes, MIlik_1, Milk_2, Sour_Cream_1, Sour_Cream_2, Vastlakukkel, Whipped_Cream).The type of the feature is factor.
Shop_id: This feature shows the list of shops (3, 4, 18, 21, 32).The type of the feature is integer. Around 35% of the data in this feature is for Shop 4.
Using R we can add an additional column Day that converts the date to a day of the week.
sh = read.csv("C:/Users/arsov/OneDrive/Documents/product_time_shop-txt.csv")
sh$day <- weekdays(as.Date(sh$date))
table(sh$day)
##
## Friday Monday Saturday Sunday Tuesday Wednesday
## 23948 9649 32260 20460 38174 14001
We can also calculte the number of sold product in different shops. This is represented in the following tables:
Draw violin plots and/or boxplots (preferably overlaying them) that would allow comparing different weekdays, shops, and product sales. Identify some meaningful illustrations to draw conclusions about 1) different weekdays, 2) different products, 3) shops. State your hypothesis and then draw respective analysis of data.
Solution:
The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. A violin plot is a method of plotting numeric data which is similar to box plot with a rotated kernel density plot on each side that shows the probability density of the data at different values. The box plots that compare the time with the products, shop_id and days are shown on the following plot.
The overlayed box plots and violin plots are shown on the following images.
From the box plots for example we can notice that bananas are most commonly sold between \(13:04\) and \(18:20\), the earliest time when banana was both was \(7:32\) and the latest \(23:10\). Simillarly the most common time when people buy Coffe Cream is between \(12:22\) and \(17:26\). The most common time for buying products is between \(12:30\) and \(17\) o’clock, although this is not true for all products.
From the box plot that compares the days and the time, we can notice that people most commonly both products betwwen \(13:33\) and \(18:27\) in Friday, \(13:24\) and \(18:43\) in Monday, \(12:47\) and \(17:43\) in Saturday etc. The earliest time when someone both a product is \(7:32\) in Monday.
From the box plot that compares the shops and the time we can see for exmple that people most commponly both products between \(12:53\) and \(18:20\) in shop 18.
Using a bar plot we can visualize on which day and in which shop bananas were mostly sold.
From this plot we can see that banans were mostly sold in Saturday in shop 4 and fewest in shop 21 in Wednesday. If we create a subset of the sold bananas in shop 4 we can see that the mean time in which the bananas were sold is \(15:46\).
We can group and plot the time periods when bananas were sold in different shops, dates and days.
From this plot we can notice that the biggest number of bananas were sold in Saturday noon and the smallest number in Wednesday night.
From this plot we can see that shop 4 is the shop that sells most bananas compared to the other shops and they sell the biggest number of bananas in the afternoon.
Using this plot we can conlude that most bananas were sol on \(2014-10-31\) in the afternoon.
Use the same data as in 3. Explore the data and identify if any of the shops has run out of any popular product during the day (which shops, products, days?). Find some visualisation to convince the reader or shop manager. Formulate the principles of an automated procedure to identify all such events across entire supermarket(s).
Solution:
One way to solve this problem is to plot the sold product for each day of the week and for each shop. Using the plots we can notice in which day of the week the sale of the products drops or disappears. This can be an indication that the shop has run out from that particular product.
Because we do not have any information about Thursday’s we can assume that shops do not work on this day. Some of the conlusions we can get from the plots are:
Bananas: We can notice that the banana’s sale in Wednesday is unusually small compared with the other days. Therefore we can assume that the shops have run out of product in Wednesday’s.
Coffee_Cream and Eggs_1: Shops have run out of Coffe Cream and Eggs 1 in Monday.
Eggs_2: Eggs_e are sold in only two shops. Shop 3 has run out of eggs in Wednesday and shop 4 in Monday.
Grapes: The sale of grapes is unusually small in Wednesday in Shop 3. We can assume that shop 3 has run of grapes in that day.
Milk_1: We can notice that people mostly buy this product durin the weekend. As a result most of the shops have run out of milk in Monday.
Milk_2: There is no sudden drop of the sales of this product, so we can not conculde if the shops have run out of it.
Sour Cream 1 and Sour cream 2: The shops are running out of sour cream in Monday’s.
Vastlakukkel: The data we have for this product is from only one day, Tuesday 2014-03-04. Therefore we can not use this method to draw conclusions about this product.
Solution:
From the Q-Q plots for Klient 1 and Klient 3 we can conclude that we are not sampling from a normaly distributed data. The S shape of the plots indicates that the data has a fat tail distribution and that the largest and the smallest values are not as extreme as expected.
If we compare Klient 1 and Klient 3 we can notice that they do not appear to have come from data sets with a common distribution because they do not form a straight line. Periodically Klient 3 has larger values than Klient 1 and then the values for the data sets get closer again.
Similarly as in the previous plots we can conclude that we are not sampling from a normaly distributed data. The S shape of the plot indicates that the data has a fat tail distribution and that the largest and the smallest values are not as extreme as expected.